Goal

Identify characteristics of individual Soldiers that contribute to unit performance

Why

Key to extending work to unit performance without requiring unit-level measurement

Method

Document Analysis

Document analysis is a “form of qualitative research in which documents are interpreted by the researcher to give voice and meaning around an assessment topic" (Bowen, 2009)

Our Approach: A Mixed Method Thematic Approach

Natural Language Processing

Natural language processing (NLP) "strives to build machines that understand and respond to text data much the same way humans do" (IBM, 2022)

All NLP for this project was run on a corpus composed of the 10 documents on which document analysis was conducted. The documents were clean by removing stop words (i.e. the, and, of) and legitimatizing each word. Lemmatizing transforms each word to a common base word.

Cooccurance

Cooccurance visualizes word pairs that occur together most frequently in the corpus.

Term Frequency- Inverse Document Frequency

TF-IDF is a text classification method that reflects the importance of a word to a document in a corpus of documents.

Latent Dirichlet Allocation

LDA is a Bayesian topic model that discovers the topics in a corpus of documents and the concurrent probability that they will occur.

Qualitative Analysis Data

  • link PDF documents in Github
  • explanation of documents
  • excel of excerpts if it's in github
  • screenshots from dedoose
  • yay
Excerpt Count

Excerpt Count

look at our exceprt count per document above

Natural Language Processing Data

  • could have our code ???
  • charts from github
  • link to github if u can easily
  • skylar explaing how the code works / how it is actually measuring what u claim

QA

  • chart of codes
  • explain that

NLP

Co-Occurance

Topic Modeling

tf-idf

Term Frequency Inverse Document Frequency

Flowchart

Flowchart

1. Document Discovery

Document Discovery{dim = c(8, 6)} A list of 55 documents from the Army and external sources that pertained to the role of an individual in unit performance were collected. The author, keywords, uses, a description, date written, and any notes were collected for each document.

2. Document Selection

10 documents were selected to be from a variety of sources, sectors, and time periods. Documents most relevant to the theme were chosen.
The documents selected can be found here

3. Initial Analysis & Code Development

The 10 documents in the corpus were read a first time to determine major themes that emerged. A literature review was also conducted to determine possible biases that would influence the documents.
Possible Biases:
- Historic Army recruitment tests reinforced institutional bias and maintained segregation. Due to Jim Crow laws, black recruits had not received the same education.
- Until August 2014, a row of chairs was placed behind the female platoon at Marine recruit training for recruits who were too exhausted to stand, despite completing boot camp under the same conditions and requirements as their male peers.
- “Don’t ask, don’t tell” policy which barred openly LGBTQIA+ persons from joining the military was lifted September 20th, 2011. The law claimed, among other things, their presence would “create a risk (…) to unit cohesion”.

4. Code Refinement

Initial and Emergent Codes After all of the documents were read, the codes were refined so that there were not multiple codes to cover the same idea.

5. Primary Reading

Word Cloud

Once the codes were selected, each document was read in its entirety using Dedoose. There were 319 excerpts and 506 code applications.

6. Secondary Coder Test

Coder Test Tests were conducted on Dedoose to measure inter-coder reliability. Readers outside of the project took a test to determine if their application of the codes matched that of the initial coder.

7. Cleaning

Lemmatization Example

To begin conducting Natural Language Processing(NLP) on the documents, the corpus was uploaded as text files into R Studio. All of the documents were cleaned by making all of the words lowercase, removing non-letter charters, and removing white space. It was then decided that stop words(i.e. the, and, of) and any word less than four letters should be removed. Last names that were commonly referenced in documents were also removed. Then, all of the documents were legitimatized. Legitimization groups together multiple forms of the same word so that they can be analyzed as a single concept.

8. Word Cooccurence Test

The corpus was split into documents and then paragraphs. The words in each paragraph were tokenized and the paragraph to which they belong was recorded. The amount of times word pairs occurred in each paragraph were recorded and the highest occurrences were recorded.

9. TF-IDF

The term frequency, inverse document frequency was recorded for each document. This is calculated by multiplying the number of times each word appears in a given document and the inverse document frequency of the word in the corpus of documents. The final set of words shows which terms make each document unique from other documents in the corpus. #10. LDA
Latent Dirichlet Allocation is a form of topic modeling that is used to show topics that emerge within the corpus of documents and what words are associated with each topic. For the purpose of this project, it was selected to create 18 topics as 18 codes were discovered when manually reading the documents.

11. Triangulation

Based on the LDA model, codes were matched with the topics to see if the manually labeled documents coincided with the natural language processing.

yay

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Interns

Skylar Haskiell

Skylar Haskiell

Skylar Haskiell

Skylar is a rising third year at the University of Virginia pursuing a B.S. in Computer Science.

Jillian Eberhart

Jillian is a rising third year at the Univerisity of Virginia pursuing a B.A. in Statistics with a minor in Health and Wellness.